You are encouraged to discuss problem sets with your fellow students (and with the Course Instructor of course), but you must write your own final answers, in your own words. Solutions prepared ``in committee’’ or by copying someone else’s paper are not acceptable. This violates the Brown standards of plagiarism, and you will not have the benefit of having thought about and worked the problem when you take the examinations.
All answers must be in complete sentences and all graphs must be properly labeled.
For the PDF Version of this assignment: PDF
For the R Markdown Version of this assignment: RMarkdown
Please turn the homework in through canvas. You may use a pdf, html or word doc file to turn the assignment in.
This homework will use the following data:
Y, jointly even though they do not necessarily predict Y very well individually. There are two predictor variables in the data set, X1 and X2.We can see that it is hard to tell if there is a relationship between X1 and Y. However there does appear to be a positive linear relationship between X2 and \(Y\). We can also see a strong relationship between X1 and X2.
Y. Comment on whether the predictors seem to relate to Y. What percent of the variability in Y does each predictor explain by itself?| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | -0.0041133 | 0.6815769 | -0.0237633 | 0.0155367 |
| x1 | 0.0082620 | 0.4105894 | -0.0114188 | 0.0279428 |
| (Intercept) | -0.0032395 | 0.7190647 | -0.0208921 | 0.0144132 |
| x2 | 0.4393635 | 0.0000000 | 0.4217521 | 0.4569748 |
| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.0000677 | -0.0000323 | 1.0024504 | 0.6771501 | 0.4105894 |
| 0.1930245 | 0.1929438 | 0.9005499 | 2391.4715062 | 0.0000000 |
We can see in fit.x1 that X1 does not have a significant relationship with Y. The model explains <1% of the variation in Y. For fit.x2 we can see that it has a significant positive relationship with Y. This model explains 19.3% of the variation in Y.
lm() to build a multiple regression model using both predictor variables X1 and X2. Comment on the fit and the statistical significance of each predictor variable. What percent of the variability in Y is explained by the model now that both predictors are included? Give an explanation for what you think is happening with both predictors in the model.| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | -0.0002643 | 0.8542291 | -0.003084 | 0.0025555 |
| x1 | -2.0397378 | 0.0000000 | -2.046208 | -2.0332674 |
| x2 | 2.2674010 | 0.0000000 | 2.260956 | 2.2738461 |
| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.979412 | 0.9794078 | 0.143849 | 237788.1 | 0 |
We can see that when we include X1 and X2 in the model together that x2 remains significant but has an increased estimated effect given a level of X2. This time X1 has a significant effect and it shows to be negative given a level of X2. Togteher they explain 97.94% of the variation in Y. It is hard to say what exactly is going on without investigating further. It seems that when controlling for both X1 and X2 at the same time we are able to account for more variation in Y.
With X1 there is a change in both qualitative (direction) and quantitative (magnitude). With X2 there is only a quantitative (magnitude) change.
X1 and X2 predict Y so well together when they do not alone?We can see that the relationship between all of these is actually a plane. This suggests that it requires more to explain the variability in Y than just one variable alone. Everything we have seen shows that when you have the same level of X1 that an increase in X2 will lead to an increase in Y. However when you have the same level of X2 an increase in X1 leads to a decrease in Y. This partitioning of the data allows us greater understanding of the variance of Y.
Data set hw1b contains air pollution data from 41 U.S. cities. Our goal is to try to build a multiple regression model to predict SO2 concentration using the other variables.
| Variable Name | Description |
|---|---|
so2 |
SO2 air concentration in micrograms per cubic meter. |
temp |
Average Annual temperature in degrees F. |
empl20 |
The number of manufacturing companies with 20 or more workers. |
pop |
The population in thousands. |
wind |
The average annual wind speeds in miles per hour. |
precipin |
The average annual precipitation in inches. |
precipdays |
The average number of days with precipitation per year. |
| Variable | Minimum | 1st Qua. | Median | Mean | 3rd. Qua. | Maximum |
|---|---|---|---|---|---|---|
| SO2 | 8 | 13 | 26 | 30.05 | 35 | 110 |
| temp | 43.5 | 50.6 | 54.6 | 55.76 | 59.3 | 75.5 |
| empl20 | 35 | 181 | 347 | 463.1 | 462 | 3344 |
| pop | 71 | 299 | 515 | 608.6 | 717 | 3369 |
| wind | 6 | 8.7 | 9.3 | 9.444 | 10.6 | 12.7 |
| Precipin | 7.05 | 30.96 | 38.74 | 36.77 | 43.11 | 59.8 |
| preicdays | 36 | 103 | 115 | 113.9 | 128 | 166 |
We can see that the table above shows that the number of manufacturing companies with 20 or more workers seems to have some extreme values. Since 75% of the data falls at 462 and below, however the max is 3344. Population also seems like it may have some extreme values since 75% of the values are at or below 717 but the maximum is at 3369. We will evaluate these 2 futher with boxplots.
## [1] 11 29 31
## [1] 9
## [1] 11 18 27 29
## [1] 11 18 29
## [1] 1 23
## [1] 1 23 25
From the above graphs we can see that there do appear to be some extreme values with population and number of companies with 20 or more eomployees. Record 11 seems to the the largest of these values.
Another way to find the outliers, recall this from second set of notes in 1510/2510
outliers of so2
which(hw1b$so2>1.5*IQR(hw1b$so2)+quantile(hw1b$so2, 0.75)|hw1b$so2<quantile(hw1b$so2, 0.25)-1.5*IQR(hw1b$so2))
## [1] 11 29 31
outliers of temp
which(hw1b$temp>1.5*IQR(hw1b$temp)+quantile(hw1b$temp, 0.75)|hw1b$temp<quantile(hw1b$temp, 0.25)-1.5*IQR(hw1b$temp))
## [1] 9
outliers of empl20
which(hw1b$empl20>1.5*IQR(hw1b$empl20)+quantile(hw1b$empl20, 0.75)|hw1b$empl20<quantile(hw1b$empl20, 0.25)-1.5*IQR(hw1b$empl20))
## [1] 11 18 27 29
outliers of pop
which(hw1b$pop>1.5*IQR(hw1b$pop)+quantile(hw1b$pop, 0.75)|hw1b$pop<quantile(hw1b$pop, 0.25)-1.5*IQR(hw1b$pop))
## [1] 11 18 29
outliers of wind
which(hw1b$wind>1.5*IQR(hw1b$wind)+quantile(hw1b$wind, 0.75)|hw1b$wind<quantile(hw1b$wind, 0.25)-1.5*IQR(hw1b$wind))
## integer(0)
outliers of precipin
which(hw1b$precipin>1.5*IQR(hw1b$precipin)+quantile(hw1b$precipin, 0.75)|hw1b$precipin<quantile(hw1b$precipin, 0.25)-1.5*IQR(hw1b$precipin))
## [1] 1 23
outliers of precipdays
which(hw1b$precipdays>1.5*IQR(hw1b$precipdays)+quantile(hw1b$precipdays, 0.75)|hw1b$precipdays<quantile(hw1b$precipdays, 0.25)-1.5*IQR(hw1b$precipdays))
## [1] 1 23 25
| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| temp | -1.4081325 | 0.0046245 | -2.3559546 | -0.4603104 |
| empl20 | 0.0268587 | 0.0000054 | 0.0165457 | 0.0371718 |
| pop | 0.0200136 | 0.0010350 | 0.0085979 | 0.0314293 |
| wind | 1.5557412 | 0.5559350 | -3.7417772 | 6.8532595 |
| precipin | 0.1082620 | 0.7360031 | -0.5366161 | 0.7531401 |
| precipdays | 0.3272603 | 0.0174044 | 0.0607506 | 0.5937700 |
| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.1880091 | 0.1671889 | 21.42044 | 9.0300972 | 0.0046245 |
| 0.4157267 | 0.4007453 | 18.17025 | 27.7495857 | 0.0000054 |
| 0.2438183 | 0.2244290 | 20.67121 | 12.5749043 | 0.0010350 |
| 0.0089663 | -0.0164448 | 23.66448 | 0.3528489 | 0.5559350 |
| 0.0029479 | -0.0226176 | 23.73623 | 0.1153071 | 0.7360031 |
| 0.1365773 | 0.1144382 | 22.08842 | 6.1690682 | 0.0174044 |
We can see that with temperature we do have a significant negative predicted value which explains about 19% of the variation in So2 levels. With the number of companies with 20+ employees we have a positive highly significant relationship that explains about 42% of the variation in So2 levels. With population we see that we have a significant positive relationship that explains about 24% of the variation in So2 levels. With Average annual wind we have an insignificant relationship and it explains about 1% of the variation. With Average annual precipitation we have an insignificant relationship and it explains <1% of the variation. With Average annual days of precipitation we have an significant positive relationship and it explains about 14% of the variation.
From our discussion above we will begin with the 2 variables with the best fit. Which is companies with 20+ employees and population.
| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | 26.3250833 | 0.0000000 | 18.5505206 | 34.0996460 |
| empl20 | 0.0824341 | 0.0000020 | 0.0526825 | 0.1121857 |
| pop | -0.0566066 | 0.0003192 | -0.0855548 | -0.0276584 |
| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.5863202 | 0.5645476 | 15.48908 | 26.92924 | 1e-07 |
We can see from the model summary that both empl20 and pop have significant effects on So2. empl120 shows a positive relationship given pop which is actually larger than it was before adjusting for pop. When adjusting for empl20 the population actually has a negative relationship with so2. From here I will try and add in temperature.
| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | 58.1959320 | 0.0072804 | 16.6835168 | 99.7083472 |
| empl20 | 0.0712252 | 0.0000796 | 0.0386842 | 0.1037661 |
| pop | -0.0466475 | 0.0043900 | -0.0777939 | -0.0155011 |
| temp | -0.5871451 | 0.1220314 | -1.3388780 | 0.1645878 |
| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.6125468 | 0.5811317 | 15.19126 | 19.49847 | 1e-07 |
With the addition of temperaure. My adjusted \(R^2\) increased only a small amount to 0.6125. This would suggest that temperature does add something.
I proceeded checking different models in this fashion. Below I will discuss the final model I chose.
| term | estimate | p.value | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | 100.1524457 | 0.0021815 | 38.6905106 | 161.6143809 |
| empl20 | 0.0648871 | 0.0001881 | 0.0333293 | 0.0964450 |
| pop | -0.0393347 | 0.0124987 | -0.0696575 | -0.0090119 |
| precipin | 0.4194681 | 0.0604983 | -0.0195319 | 0.8584681 |
| wind | -3.0823996 | 0.0896221 | -6.6668052 | 0.5020060 |
| temp | -1.1212877 | 0.0107070 | -1.9655351 | -0.2770403 |
| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.6685085 | 0.6211526 | 14.44732 | 14.11668 | 1e-07 |
Your answer may differ from mine but here is my reasoning for why I chose my model. I chose to go back to my model with the predictors of pop ,empl20, precipin, wind and temp. This model has an overall \(R^2\) of 0.6685. So it explains about 67% of the variation. If you compare this model to other subsets you will find that it also have the largest adjusted \(R^2\) out of them. We can see that population and wind changes both in magnidute and in direction of effect. Even though in this model precipin and wind are not significant, the model allows me to explain about 6% more variation than if I left them out.
Adjusted \(R^2\) is a way for us to compare models. It is a way of weighting the \(R^2\) to show the effect of adding extra variables. Here, the adjusted \(R^2\) indicates that the model fit is good.
If we look at the p-values in the table above we can see that empl20 is significant with a p-value of 0.0002. We can also see that pop is signifianct with a a p-value of 0.012. Finally, temp is significant with a p-value of 0.011. The rest have p-values over 0.05